Data mining
Data exploration An example of data exploration can be found on this site. See the correlation between each pair of variables (cor() for numerical value, pairs() for plot). Histogram (hist(), or plot(density() ''for continuous curve) to get an idea of the distribution and boxplot (''boxplot()) to better see how skewed are the data. Feature reduction techniques Principal component analysis Independent component analysis Non-negative matrix factorization Singular value decomposition Relief-F Kruskal-Wallis A matlab toolbox for feature reduction is available here. Text mining RTextTools ''uses ''tm and randomforest Decision trees Cart tree library("party") iris_ctree <- ctree(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=iris) plot(iris_ctree) Random forest library(randomForest) rf <- randomForest(Class ~ ., data = mydata) table(predict(rf), mydata$Class) Clustering K-means (kc <- kmeans(iris,-which(colnames(iris) "Species"), 3)) table(iris$Species, kc$cluster) plot(newiris"Sepal.Width"), col = kc$cluster) points(kc$centers"Sepal.Width"), col = 1:3, pch = 8, cex=2) Hierarchical clustering idx <- sample(1:dim(iris)1, 40) irisSample <- irisidx, irisSample$Species <- NULL hc <- hclust(dist(irisSample), method="ave") plot(hc, hang = -1, labels=iris$Speciesidx) The basic hierarchical clustering functions in R are hclust, flashClust, agnes and diana. Hclust and agnes perform agglomerative hierarchical clustering, while diana performs divisive hierarchical clustering. flashClust is a highly speed improved (50-100 faster) version of hclust. The pvclust package can be used for assessing the uncertainty in hierarchical cluster analyses. It provides approximately unbiased p-values as well as bootstrap p-values. Bootstrap Analysis in Hierarchical Clustering The pvclust package allows to assess the uncertainty in hierarchical cluster analysis by calculating for each cluster p-values via multiscale bootstrap resampling. The method provides two types of p-values. The approximately unbiased p-value (AU) is computed by multiscale bootstrap resampling. It is a less biased p-value than than the second one, bootstrap probability (BP), which is computed by normal bootstrap resampling. library(pvclust) # Loads the required pvclust package. y <- matrix(rnorm(500), 50, 10, dimnames=list(paste("g", 1:50, sep=""),paste("t", 1:10, sep=""))) # Creates a sample data set. pv <- pvclust(scale(t(y)), method.dist="correlation", method.hclust="complete", nboot=10) Time series analysis Time Series Decomposition Time series decomposition is to decompose a time series into trend, seasonal, cyclical and irregular components. f <- decompose(AirPassengers) plot(f) Time Series Forecasting Time series forecasting is to forecast future events based on known past data. fit <- arima(AirPassengers, order=c(1,0,0), list(order=c(2,1,0), period=12)) fore <- predict(fit, n.ahead=24) #error bounds at 95% confidence level U <- fore$pred + 2*fore$se L <- fore$pred – 2*fore$se ts.plot(AirPassengers, fore$pred, U, L, col=c(1,2,4,4), lty = c(1,1,2,2)) legend(“topleft”, c(“Actual”, “Forecast”, “Error Bounds (95% Confidence)”),col=c(1,2,4), lty=c(1,1,2)) Bayesian network From Wikipedia: Bayesian networks are used for modelling knowledge in computational biology and bioinformatics (gene regulatory networks, protein structure, gene expression analysis[10]), medicine,[11] document classification, information retrieval,[12] image processing, data fusion, decision support systems,[13] engineering, gaming and law. A bayesian network analysis R package is deal. Artificial neural network Multi-layer perceptron approach is not convex: many local minima can occur. Among neural network models, the self-organizing map (SOM) and adaptive resonance theory (ART) are commonly used unsupervised learning algorithms. Support vector machines Problem is convex, that means that the global minima will be reached. #data preparation ##generate random sets for training, model validation and test sets ##consider data normalization or scaling if appropriate (if continuous data, low-values variables may be neglected wrt to large-values ones) #binary vs multiclass: SVM or LS-SVM or KLR (kernel logistic regression) #Probabilistic or not probabilistic: bayesian approach, Platt's algorithm, isotonic regression #Dealing with unbalanced data: prior probabilities, weighting, bias term correction #Feature selection: filter methods, wrapper methods, embedded methods ##do not use information from the test set ##comparing complex feature selection methods with simple and fast filter techniques #Tuning parameters: kernel(linear, RBF, polynomial), tuning parameters (kernel, penalization), strategy (cross validation, etc) ##not too much: risk of overfitting ##use global optimization method or define a coarse grid of values for the tuning parameters on a log-scale ##if necessary fine-tune with a finer grid around the parameter values found previously ##sigma parameter for an RBF kernel should scale with the input dimension ##consider the use of a linear kernel when the input dimension is much larger than the number of data #evaluation: measure (AUC, accuracy , error rate, etc), strategy (cross-validation, repeated sampling), statistical testing (corrected tests, etc) ##performance measure with AUC ##compare the performance of different algorithms ##consider collecting additional test data after the classifier has been developped and learn new models if necessary (performance not good enough... this is sadly a subjective thing) SVM vs LS-SVM LS-SVM is faster than SVM, a linear programming problem is solved instead of a quadratic one. The counterpart is that all data are support vectors Genetic algorithms Association rules learning Miscellaneous 10 best-known algorithms An IEEE contest declares that the 10 best-known algorithms were: #C4.5 #The k''-Means algorithm #Support Vector Machines #The Apriori algorithm #Expectation-Maximization #PageRank #AdaBoost #k-Nearest Neighbor Classification #Naive Bayes #CART (Classification and Regression Trees) Some good websites *Slides on several statistical methods here: [http://www.autonlab.org/tutorials/dtree.html '''Decision Trees'], Information Gain, [http://www.autonlab.org/tutorials/prob.html Probability for Data Miners, ''']Neural Networks, etc.''' *slides of 10+ talks at R Users Groups in Australia here *ARMiner is a client-server data mining application specialized in finding association rules. * Free Data Mining Tools * Predicting Stock Market Returns * Questions/Answers about machine learning * sofia-ml The C++ code is highly readable and it supports classification, regression and ranking with SGD, Pegasos, Passive-Aggressive and other good stuff. *Bolt binary classification with Stochastic Gradient Descent or Pegasos, multi-class classification (One-versus-all, Averaged Perceptron, Maximum Entropy) *Shogun A Large Scale Machine Learning Toolbox *Tutotials *http://cran.r-project.org/web/views/Bayesian.html